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Abstract — Renyi divergence is related to Renyi entropy much 
like Kullback-Leibler divergence is related to Shannon's entropy, 
and comes up in many settings. It was introduced by Renyi as a 
measure of information that satisfies almost the same axioms as 
Kullback-Leibler divergence, and depends on a parameter that 
is called its order. In particular, the Renyi divergence of order 1 
equals the Kullback-Leibler divergence. 

We review and extend the most important properties of Renyi 
divergence and Kullback-Leibler divergence, including convexity, 
continuity, limits of a-algebras and the relation of the special 
order to the Gaussian dichotomy and contiguity. We also extend 
the known equivalence between channel capacity and minimax 
redundancy to continuous channel inputs (for all orders), and 
present several other minimax results. 

Index Terms — channel capacity, Kullback-Leibler divergence, 
minimax redundancy, Renyi divergence 

I. Introduction 

SHANNON entropy and Kullback-Leibler divergence (also 
known as information divergence or relative entropy) are 
perhaps the two most fundamental quantities in information 
theory and its applications. Because of their success, there 
have been many attempts to generalise these concepts, and in 
the literature one will find numerous entropy and divergence 
measures. Most of these quantities have never found any appli- 
cations, and almost none of them have found an interpretation 
in terms of coding. The most important exceptions are the 
Renyi entropy and Renyi divergence Q~). Harremoes []2] and 
Griinwald (3] p. 649] provide an operational characterization 
of Renyi divergence as the number of bits by which a mixture 
of two codes can be compressed; and Csiszar (4| gives an 
operational characterization of Renyi divergence as the cut- 
off rate in block coding and hypothesis testing. 

Renyi divergence appears as a crucial tool in proofs of 
convergence of minimum description length and Bayesian 
estimators, both in parametric and nonparametric models [|5j , 
H, and one may recognize it implicitly in many computations 
throughout information theory. It is also closely related to 
Hellinger distance, which is commonly used in the analysis of 
nonparametric density estimation Q-Jl]. Renyi himself used 
his divergence to prove the convergence of state probabilities 
in a stationary Markov chain to the stationary distribution [TJ, 
and still other applications of Renyi divergence can be found, 
for instance, in hypothesis testing IfTOl . in multiple source 
adaptation ifTTl and in ranking of images lfl2l . 

Although the closely related Renyi entropy is well studied 
iTOl . HH, the properties of Renyi divergence are scattered 
throughout the literature and have often only been established 

Tim van Erven (tim@timvanerven.nl) is with the Departement de 
Mathematiques, Universite Paris-Sud, France. Peter Harremoes (har- 
remoes@ieee.org) is with the Copenhagen Business College, Denmark. Some 
of the results in this paper have previously been presented at the ISIT 2010 
conference. 



for finite alphabets. This paper is intended as a reference 
document, which treats the most important properties of Renyi 
divergence in detail, including Kullback-Leibler divergence as 
a special case. Preliminary versions of the results presented 
here can be found in lfT31 and ifTBI . During the preparation 
of this paper, Shayevitz has independently published closely 
related work IfTTl. lfl8l. 



A. Renyi 's Information Measures 

For finite alphabets, the Renyi divergence of positive order 
a ^ 1 of a probability distribution P = (pi, ■ ■ ■ ,p n ) from 
another distribution Q = (q%, . . . , q n ) is 



D a (P\\Q) = 



i n 



Pi 1i 



(1) 



* as pf /q\ a ^ and adopt the 
oo for x > 0. As described 



where, for a > 1, we read pfq.i 
conventions that jj = and g 
in Section [II] this definition generalises to continuous spaces 
by replacing the probabilities by densities and the sum by an 
integral. If P and Q are members of the same exponential 
family, then their Renyi divergence can be computed using a 
formula by Liese and Vajda OH p. 43], fll). Gil |20) provides 
a long list of examples. 

Example 1. Let Q be a probability distribution and A a set 
with positive probability. Let P be the conditional distribution 
of Q given A. Then 

D a (P\\Q) = -lnQ(A). 

We observe that in this important special case the factor — ^rj 
in the definition of Renyi divergence has the effect that the 
value of D a (P\\Q) does not depend on a. 

The Renyi entropy 

H a {P) 



1 



Q 



i=l 



can be expressed in terms of the Renyi divergence of P from 
the uniform distribution U = (—,...,—): 

H a (P) = H a (U) - D a (P\\U) = Inn - D a (P\\U). 

As a tends to 1, the Renyi entropy tends to the Shannon 
entropy and the Renyi divergence tends to the Kullback- 
Leibler divergence, so we recover a well-known relation. The 
differential Renyi entropy of a distribution P with density p is 
given by 

K(P) = T^— ln I {p(x)) a dx 



1 



o 
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Fig. 1. Renyi divergence as a function of its order for fixed distributions 

whenever this integral is defined. If P has support in an 
interval / of length n then 

h a (P)=]nn-D a (P\\U I ), 

where Ui denotes the uniform distribution on /. Thus the 
properties of both the Renyi entropy and the differential 
Renyi entropy can be deduced from the properties of Renyi 
divergence as long as P has compact support. 

There is another way of relating Renyi entropy and Renyi 
divergence, in which entropy is considered as self-information. 
Let X denote a discrete random variable with distribution P, 
and let Pdiag be the distribution of (X, X). Then 

H a (P) = D 2 „ a (P didg \\P x P). 

For a tending to 1, the right-hand side tends to the mutual 
information between X and itself, and again a well-known 
formula is recovered. 

B. Special Orders 

Although one can define the Renyi divergence of any order, 
certain values have wider application than others. Of particular 
interest are the values 0, \, 1, 2, and oo. 

The values 0, 1, and oo are extended orders in the sense 
that Renyi divergence of these orders cannot be calculated by 
plugging into (|T). Instead, their definitions are determined by 
continuity in a. (See Figure Q]) This leads to defining Renyi 
divergence of order 1 as the Kullback-Leibler divergence. 
For order it becomes — \nQ({i \ pi > 0}), which is 
closely related to absolute continuity and contiguity of the 
distributions P and Q (see Section UlI-FI ). For order oo, Renyi 
divergence is defined as lnmax^ In the literature on the 
minimum description length principle in statistics, this is called 
the worst-case regret of coding with Q rather than with P 
0. The Renyi divergence of order oo is also related to the 
separation distance, used by Aldous and Diaconis 1211 to 
bound the rate of convergence to the stationary distribution 
for certain Markov chains. 

Only for a = 1/2 is Renyi divergence symmetric in its 
arguments. Although not itself a metric, it is a function of the 
square of the Hellinger distance Hel 2 (P, Q) = SJLiCs/P* — 



2 

Similarly, for a = 2 it satisfies 

D 2 (P\\Q) = ln(l + X 2 (PQ)) , (3) 

where x 2 {PtQ) — Y^i=i ^'~ 9 '' > denotes the x 2 -divergence 
1221 . It will be shown that Renyi divergence is nondecreasing 
in its order. Therefore, by hit < t — 1, (O and ® imply that 

Hcl 2 (PQ) < Di{P\\Q) < £>i(P||Q) 

<D 2 (P\\Q)< X 2 (P,Q). 

Finally, Gilardoni [23 1 shows that Renyi divergence of orders 
a e (0, 1] is related to the total variation distance V(P, Q) = 
yi— 1 \ q% ~ Pi\ by a generalisation of Pinsker's inequality: 

^V 2 (P,Q)<D a (P\\Q). (4) 

For a = 1 this is the normal version of Pinsker's inequality, 
which bounds total variation distance in terms of the square 
root of the Kullback-Leibler divergence. 

C. Outline 

The rest of the paper is organized as follows. First, in 
Section we extend the definition of Renyi divergence 
from formula (H) to continuous spaces. One can either define 
Renyi divergence via an integral or via discretizations. We 
demonstrate that these definitions are equivalent. Then we 
show that Renyi divergence extends to the extended orders 0, 
1 and 00 in the same way as for finite spaces. Along the way, 
we also study its behaviour as a function of a. By contrast, 
in Section [Til] we study various convexity and continuity 
properties of Renyi divergence as a function of P and Q, 
while a is kept fixed. Section [IV] contains several minimax 
results, and treats the connection to Chernoff information 
in hypothesis testing, to which many applications of Renyi 
divergence are related. We also discuss the equivalence of 
channel capacity and the minimax redundancy for all orders 
a. Then, in Section [V] the main part of the paper is completed 
by an extension of Renyi divergence to negative orders. These 
are related to the orders a > 1 by a negative scaling factor 
and a reversal of the arguments P and Q. Finally, the appendix 
contains a number of negative results, i.e. examples showing 
that properties that hold for certain other divergences are 
violated by Renyi divergence. 

For fixed a, Renyi divergence is related to various forms of 
power divergences, which are in the well-studied class of /- 
divergences ll24l . Consequently, several of the results we are 
presenting for fixed a in Section [Hi] are equivalent to known 
results about power divergences. To make this presentation 
self-contained we avoid the use of such connections and only 
use general results from measure theory. 

II. Definition of Renyi divergence 

Let us fix the notation to be used throughout the paper. We 
consider (probability) measures on a measurable space (X 
If P is a measure on (X then we write Pig for its restric- 
tion to the er-subalgebra Q C T, which may be interpreted as 
the marginal of P on the subset of events Q. A measure P is 



JOURNAL OF LJTeX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 



3 



called absolutely continuous with respect to another measure 
Q if P{A) = whenever Q(A) = for all events A e T. 
We will write P <C Q if P is absolutely continuous with 
respect to Q and P <£l Q otherwise. Alternatively, P and 
Q may be mutually singular, denoted P _L Q, which means 
that there exists an event A € JF such that P(A) = and 
Q(X\A) = 0. We will assume that all (probability) measures 
are absolutely continuous with respect to a common er-finite 
measure p,, which is arbitrary in the sense that none of our 
definitions or results depend on the choice of \i. As we only 
consider (mixtures of) a countable number of distributions, 
such a measure fj, exists in all cases, so this is no restriction. 
For measures denoted by capital letters (e.g. P or Q), we will 
use the corresponding lower-case letters (e.g. p, q) to refer to 
their densities with respect to /i. And for any event A e F, 1a 
denotes its indicator function, which is 1 on A and otherwise. 
Finally, we use the constant r = 2tt to slightly simplify some 
expressions, and use the natural logarithm in our definitions, 
such that information is measured in nats (1 bit equals In 2 
nats). 

We will often need to distinguish between the orders for 
which Renyi divergence can be defined by a generalisation of 
formula (HJ to an integral over densities, and the other orders. 
This motivates the following definitions. 

Definition 1. We call a (finite) real number a a simple order if 
a > and a ^ 1. The values 0, 1, and oo are called extended 
orders. 



A. Definition by Formula for Simple Orders 

Let P and Q be two arbitrary distributions on (X, T). The 
formula in (Q]i, which defines Renyi divergence for simple 
orders on finite sample spaces, generalises to arbitrary spaces 
as follows: 

Definition 2 (Simple Orders). For any simple order a, the 
Renyi divergence of order a of P from Q is defined as 



D a (P\\Q) = 



1 



a 



1 



In 



p a q 1 - a 



d/i, 



(5) 



where, for a > 1, we read p a q 1 a as -§=r and adopt the 



conventions that ^ = and ^ 



oo for x > 0. 



For example, for any simple order a, the Renyi divergence 
of a normal distribution (with mean /to and positive variance 
<7g) from another normal distribution (with mean /ti and 
positive variance of) is 



a(jj,\ - 



Mo J 



2a 



1 — a 



■In- 



(6) 



provided that a 2 a = (1 - a)crg + aa\ > fT9] p. 45]. 

The interpretation of p a q 1 ~ a in Definition [2] is such that the 
Hellinger integral J p a q 1 ~ a d/i is an /-divergence 11241 . which 
ensures that the relations from the introduction to squared 
Hellinger distance (O, x 2 -distance (fj), and the total variation 
distance hold in general, not just for finite sample spaces. 



For simple orders, we may always change to integration 
with respect to P: 

l-a 



a „1 — a 



p q 



d/z 



dP 



which shows that our definition does not depend on the choice 
of dominating measure fi. In most cases it is also equivalent 
to integrate with respect to Q: 

J p a q 1 ~ a d/t = J dQ (0 < a < 1 or P < Q). 

However, if a > 1 and P <jt Q, then D a (P\\Q) = oo, 
whereas the integral with respect to Q may be finite. 



B. Definition via Discretization for Simple Orders 

We shall repeatedly use the following result, which is a 
direct consequence of the Radon-Nikodym theorem 11251 : 

Proposition 1. Suppose A <C /i is a probability distribution, 
or any countably additive measure such that X(X) < 1. Then 
for any a-subalgebra Q C T 



E 



(1A 



d/x 



(/i-a.s.) 



It has been argued that grouping observations together (by 
considering a coarser er-algebra), should not increase our 
ability to distinguish between P and Q under any measure 
of divergence 0261 . This is expressed by the data processing 
inequality, which Renyi divergence satisfies: 

Theorem 1 (Data Processing Inequality). For any simple 
order a and any a-subalgebra Q C T 

D a (P lg \\Q lg ) <D a (P\\Q). 

Theorem [9] below shows that the data processing inequality 
also holds for the extended orders. 

Proof: Let P denote the absolutely continuous component 
of P with respect to Q. Then by Proposition Q] and Jensen's 
inequality for conditional expectations 




(7) 



If < a < 1, then p a q 1 ~ a = if q = 0, so the restriction of 
P to P does not change the Renyi divergence, and hence the 
theorem is proved. Alternatively, suppose a > 1. If P <§C Q, 
then P = P and the theorem again follows from (|7). If P ^ 
Q, then D a (P\\Q) = oo and the theorem holds as well. ■ 
The next theorem shows that if A" is a continuous space, 
then the Renyi divergence on X can be arbitrarily well 
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approximated by the Renyi divergence on finite partitions of 
X. For any finite or countable partition V = {Ai, A2, ■ ■ .} of 
X, let P\ v = P\ a (v) an d Q\v = Q\<t{V) denote the restrictions 
of P and Q to the er-algebra generated by V . 



Theorem 2. For any simple order a 

D a {P\\Q)= sup D a {P lv \\Q lv ), 
v 

where the supremum is over all finite partitions V C T. 



(8) 



It follows that it would be equivalent to first define Renyi 
divergence for finite sample spaces and then extend the defi- 
nition to arbitrary sample spaces using ((8). 

The identity ([8]l also holds for the extended orders 1 and 
00. (See Theorem [Tol below.) 

Proof of Theorem [2} By the data processing inequality 

su P £ q (P|p||Q|p) <A,(P||Q). 
v 

To show the converse inequality, consider for any e > a 
discretisation of the densities p and q into a countable number 
of bins 

B^ n = {xeX\e™<p(x)<e( m+1 >, 
e ne < q(x) < e (n+1)e }, 

where n, m e {-00, . . . , — 1, 0, 1, . . .}. Let Q e = {B^ n } 
and T e = cr{Q e ) C T be the corresponding partition and a- 
algebra, and let p e = dP\Qe/dp and q e = dQige/d/z be the 
densities of P and Q restricted to F e . Then by Proposition [T] 



E[« I ?*] ^ q 



p e E[p I T e ] < p £ 
It follows that 

1-Ot 



2: 



1 



a-1 



In 



dP > 



1 



a — 1 



In 



(P-a.s.) 



dP 



and hence the supremum over all countable partitions is large 
enough: 

sup D a (P lQ \\Q lQ ) > supD Q (P| Se ||Q| Q > D a (P\\Q). 

countable Q £>0 

<y(Q)CT 

It remains to show that the supremum over finite partitions is 
at least as large. To this end, suppose Q = {P>i, B2, . . .} is any 
countable partition and let V n = {Pi, ■ ■ • , Pn-i, Ui>n ^i}- 
Then by 

p(yB 4 ) a Q(yBi) 1_O >0 (a>l), 

Jim P(y P,)"q(U P i ) 1_Q =0 (0<a<l), 
we find that 

IimU a (Pjp B ||Qi 7 , n )= lim -^-rln V P{B) a Q{B) 1 -° L 
n-1 

> lim -lnVp(P 4 )«g(P l ) 1 -" 

n— >oo a — 1 £ — ' 



= D a (P\ Q \\Q lQ ), 
where the inequality holds with equality if < a < 1. 



C. Extended Orders: Varying the Order 

As for finite alphabets, continuity considerations lead to the 
following extensions of Renyi divergence to orders for which 
it cannot be defined using the formula in (0. 

Definition 3 (Extended Orders). The Renyi divergences of 
orders and 1 are defined as 

D (P\\Q) = lim D a (P\\Q), 

D 1 {P\\Q) = \imD a {P\\Q), 

afl 

and of order oo as 

£oo(P|iQ)= lim D a {P\\Q). 

ajoa 

Our definition of Dq follows Csiszar (4). It differs from 
Renyi's original definition JT], which uses {3]l with a = 
plugged in and is therefore always zero. As illustrated by 
Section UlI-FI the present definition is more interesting. 

The limits in Definition [3] always exist, because Renyi 
divergence is nondecreasing in its order: 

Theorem 3 (Increasing in the Order). For a <G [0, oo] the 

Renyi divergence D a (P\\Q) is nondecreasing in a. On A = 
{a £ [0, oo] | < a < 1 or D a {P\\Q) < oo} it is constant 
if and only if P is the conditional distribution Q{- \ A) for 
some event A £ J 7 . 

Proof: Let a < j3 be simple orders. Then for x > the 
function x i— > x^-^ is strictly convex if a < 1 and strictly 
concave if a > 1. Therefore by Jensen's inequality 

.(i-«f=i 

dfj, = In 



1 



a-1 



In 



< 



a-1 
1 



■In 



dP 



1-0 



dP. 



On A, J ( £ ) dP is finite. As a consequence, Jensen's 

inequality holds with equality if and only if (|) 1 13 is constant 
P-a.s., which is equivalent to 2 being constant P-a.s., which 
in turn means that P = Q(- | A) for some event A. 

From the simple orders, the result extends to the extended 
orders by the following observations: 

D (P\\Q)= inf D a (P\\Q), 

0<a<l 

D X {P\\Q)= sup D a (P\\Q) < inf D a (P\\Q), 

0<a<l a > 1 

Poo(P||Q) = su P p Q (P!|g). 

a>l 



Let us verify that the limits in Definition [3] can be expressed 
in closed form, just like for finite alphabets. We require the 
following lemma: 

Lemma 1. For any sequence a±, 012, . . . 6 A such that a n — > 
peAu{0A} 

lim [ p^q 1 " " dp = [ lim p a V~ an d/i. (9) 

n— >oo J J n— ¥00 



Our proof extends a proof by Shiryaev [|25l pp. 366-367] 
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Proof: We will verify the conditions for the dominated 
convergence theorem 11251 , from which ((9) follows. First 
suppose < j8 < 1. Then < a n < 1 for all sufficiently 
large n. In this case p an q 1 ~ OIn , which is never negative, does 
not exceed a n p + (1 — a n )q < p + q, and the dominated 
convergence theorem applies because J(p + q) d/i = 2 < oo. 
Secondly, suppose /3 > 1 and assume without loss of gen- 
erality that a„ > 0. Then there exists a 7 > (3 such that 
7 G AU {1} and a„ < 7 for all sufficiently large n. If 7 = 1, 
then a n < 1 and we are done by the same argument as above. 



So suppose 7 > 1. Then convexity of p" 
that for a n < 7 



in a n implies 



p "q 



^<{l-^)p°q l 



7 



— p q S 9 

7 



Since J qd/j, = 1, it remains to show that J p 7 ;? 1 r d/i < 00, 
which is implied by 7 > 1 and D y (P\\Q) < 00. ■ 
The closed-form expression for a = follows immediately: 

Theorem 4 (a = 0). 

£> (P||Q) = -lnQ(p>0). 

Proof of Theorem [?} By Lemma 03 and the fact that 
lim^opV - " = l{ p >o}<7- ■ 
For a = 1, the limit in Definition [3] equals the Kullback- 
Leibler divergence of P from Q, which is defined as 



D(P\\Q)= [ pln^d/x, 

j q 



with the conventions that 01n(-) = and pln(|) = 00 if 
p > 0. Consequently, P(P||Q) = 00 if P it Q. 



Theorem 5 (a = 1). 

D X (P\\Q) = D(P\\Q). 



(10) 



Moreover, if D(P\\Q) = 00 or there exists a f3 > 1 smc/z f/iaf 
P^(P||Q) < 00, then also 



\imD a (P\\Q) = D(P\\Q). 



(id 



For example, by letting a f 1 in © or by direct com- 
putation, it can be derived (19 1 that the Kullback-Leibler 
divergence between two normal distributions with positive 



variance is 

2 



Mo) 



In 



- 1 



It is possible that D a (P\\Q) = 00 for all a > 1, but 
D(P\\Q) < 00, such that (fTTb does not hold. This situation 
occurs, for example, if P is doubly exponential on X = R with 
density p(x) = e~ 2 \ x \ and Q is standard normal with density 
q{x) = e~ x / 2 / ' \[t, where r = 2n. (Liese and Vajda |24| have 
previously used these distributions in a similar example.) In 
this case there is no way to make Renyi divergence continuous 
in a at a = 1, and we opt to define D\ as the limit from below, 
such that it always equals the Kullback-Leibler divergence. 

The proof of Theorem [5] requires an intermediate lemma: 



Lemma 2. For any x > 1/2 



{x- 1) I 1+ - — - ) 



Proof: By Taylor's theorem with Cauchy's remainder 
term we have for any positive x that 



In x = x — 1 — 



(x - 1) 1 + 



2? 
2e 

for some £ between x and 1. As 4^ is increasing in £ for 
x > i, the lemma follows. ■ 
Proof of Theorem^ Suppose P <jtQ. Then P(P||Q) = 
00 = D P {P\\Q) for all (3 > 1, so O holds. And ([Toll follows 
by 

Km _J_ l n / d /i > lim - J— In / (l {q>0} p) a dfi 
ati a — 1 J ati a — 1 J 

> lim -lnP(q > 0) = 00 = D(P\\Q), 

ati Oi — 1 

where the second inequality is Jensen's. Alternatively, suppose 
P <C Q and let x a = J p a q 1 ~ a d/i. Then lim Q -^i x Q = 1 by 
Lemma Q] Therefore Lemma [2] implies that 



lim D a (P\\Q) = lim lna; a 

a-fi «fi a — 1 



lim ■ 



— = lim 

ifi a — 1 ati 



p — p a q 1 a 



p,q>0 



I- a 



d/x, (12) 



where the restriction of the domain of integration is allowed 
because q = implies p = (/i-a.s.) by P Q. Convexity 
of p a q 1 ~ a in a implies that its derivative, p a q 1 ~ a In |, is 
nondecreasing and therefore for p, g > 



1 - a 1 - a J a q 



is nondecreasing in a, and p ~^_ 9 a — > p W — = P — <?■ 
As Jp q>0 (p — q) d/z > —00, it follows by the monotone 
convergence theorem that 



lim / d/i 



p-p a q 1 a 
lim d/i 

Pl9 >o Q fi 1 - a 

= [ pln?-d(i = D(P\\Q), 

Jp,q>o q 

which together with (O proves (fTUb . If P(P||Q) = 00, then 
P/?(P||Q) > P(P||<3) = 00 for all /3 > 1 and O holds. 
It remains to prove dTTT > if there exists a j3 > 1 such that 
P^(P||Q) < 00. In this case, arguments similar to the ones 
above imply that 



limP Q (P||Q) = lim 



ail 



all 



p-q>0 



p a q 1 01 — p 
a- 1 



d/i (13) 



and 2_2 — ^ — E i s nondecreasing in a. Therefore 2_2 — = — ^ < 

a— 1 a— 1 — 

Vi - ife - and ' as J Pl9 >o ^=T- d/i < 00 is implied 
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by Dp(P\\Q) < oo, it follows by the monotone convergence 
theorem that 



f p a q^ a -p 
lim / an - 

"J- 1 J P .q>o a - 1 Jp,q>o a ^ a - 1 



p a q 1 a -p 
lim ; a^i 



/ phi^dfM = D(P\\Q), 



which together with (IT3b completes the proof. ■ 
For any random variable X, the essential supremum of X 
with respect to P is esssup P X = inf{c | P(X > c) = 0}. 

Theorem 6 (a = oo). 

P(A) ( p 
£>oo(-P||Q) = In sup = In esssup - 

Ae^ y(^) \ p q 

with the conventions that 0/0 = and x/0 — oo ;/ x > 0. 

If the sample space A? is countable, then with the nota- 
tional conventions of this theorem the essential supremum 
reduces to an ordinary supremum, and we have D, X (P\\Q) = 
lnsup x Q^j. 

Proof: If X contains a finite number of elements n, then 

n 

DUPWQ) = lim -lnY^pfqf* 

afoo a — 1 z — ' 
i=l 

, ^ , ^) 

= In max — = in max — — — . 

i q l Acx Q(A) 

This extends to arbitrary measurable spaces (X,F) by Theo- 
rem 

Doo(P\\Q) = sup supD«(P|-p||(5|-p) = sup sup D a {P\ v \ 

a<oo V V a<oo 

P(A) P(A) 
= sup In max — = In sup — , 
v Aev Q(A) agf Q(A) 

where V ranges over all finite partitions in T . 

Now if P ^ Q, then there exists an event Be J such that 
P{B) > but Q(B) = 0, and 



Taken together, the previous results imply that Renyi diver- 
gence is a continuous function of its order a (under suitable 
conditions): 

Theorem 7 (Continuity in the Order). The Renyi divergence 
D a (P\\Q) is continuous in a on A = {a G [0,oo] | < a < 
lor D a (P\\Q) < oo}. 

Proof: Continuity at any simple order f3 follows by 
Lemma Q] It extends to the extended orders and oo by the 
definition of Renyi divergence at these orders. And it extends 
to a = 1 by Theorem [5] ■ 

III. Fixed Nonnegative Orders 

In this section we fix the order a and study properties 
of Renyi divergence as P and Q are varied. First we prove 
nonnegativity and extend the data processing inequality and 
the relation to a supremum over finite partitions to extended 
orders. Then we consider various convexity and continuity 
properties. 

A. Positivity, Data Processing and Finite Partitions 
Theorem 8 (Positivity). For any order a € [0,oo] 

D a (P\\Q)>0. 

For a > 0, D a (P\\Q) = if and only if P = Q. For a = 0, 
D a (P\\Q) = if and only if Q < P. 

Proof: Suppose first that a is a simple order. Then by 
•jlensen's inequality 



a 



1 



In / p a q 1 - a d(jt 



In 



> 



a — 1 



In 



q 
p 

dp > o. 



dp 



P[i- = ooj = P(q = 0) > P(B) > 

implies that ess sup p/q = oo = sup^ P(A)/Q(A). Alterna- 
tively, suppose that P<Q. Then 

f f p p 

P (A) = / pdyu < / esssup --gd/i = esssup —Q (A) 
JAn{n>o\ JAninX)} q q 



IAn{q>0} JAn{q>0} 

for all A G J- and it follows that 

P(A) 

sup 



Equality holds if and only if qjp is constant P-a.s. (first 
inequality) and Q -c P (second inequality), which together 
is equivalent to P = Q. 

The result extends to a <G {l,oo} by D a (P\\Q) = 
s\xpp <a Dp(P\\Q). For a = it can be verified directly that 
- In Q{p > 0) > 0, with equality if and only if Q < P. ■ 

Theorem 9 (Data Processing Inequality). For any order a E 
[0,oo] and any a-subalgebra Q C T 



D a {P\g\\Q\g) < D a (P\\Q). 



(15) 



. < esssup -. 
Ae-F Q(A) q 



(14) 



Let a < ess sup p/q be arbitrary. Then there exists a set A 6 T 
with P (A) > such that | > a on A and therefore 

P{A)= / pdfi> / a- qdfi = a- Q (A) . 
J A J A 

Thus sup^gjr q|^| > a for any a < esssup |, which implies 



that 



P(A) p 
sup —, A . > ess sup - . 
AeT Q(A) ~ q 



In combination with ( fT4b this completes the proof. 



Proof: By Theorem Q] (fT5l l holds for the simple orders. 
Let j3 be any extended order and let a n — > (3 be an arbitrary 
sequence of simple orders that converges to (3, from above if 
j3 = and from below if /3 € {1, oo}. Then 

D p {P ]g \\Q ]g ) = lim D an {P\g\\Q\g) 

< lim D an (P\\Q) = D P (P\\Q). 



Theorem 10. For any a € (0, oo] 

D a (P\\Q)= sup D a (P lv \\Q {r ), 
v 
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0.4 0.6 0.8 1 P 

Fig. 2. Renyi divergence as a function of P = (p, 1— p) for Q = (1/3, 2/3) 




Fig. 3. Level curves of Di/2{P\\Q) f° r fixed Q as P ranges over the 
simplex of distributions on a three-element set 



where the supremum is over all finite partitions V C T. 

Proof: For simple orders a, the result holds by Theo- 
rem |2 This extends to a G {1, 00} by mono tonicity and left- 
continuity in a: 

D a {P\\Q) = sup D P {P\\Q) = sup sup D p (P lv \\Q lv ) 

0<a /3<a V 

= sup sup Dp(P\ v \\Q\ v ) = supDc^PipHQip). 

V /3<a V 



B. Convexity 

Consider Figures [2] and [3] They show D a (P\\Q) as a func- 
tion of P for sample spaces containing two or three elements. 
These figures suggest that Renyi divergence is convex in its 
first argument for small a, but not for large a. This is in 
agreement with the well-known fact that it is jointly convex 
in the pair (P, Q) for a = 1. It turns out that joint convexity 
extends to a < 1, but not to a > 1, as noted by Csiszar 
J4J. Our proof generalises the proof for a = 1 by Cover and 
Thomas |27l . 

Theorem 11. For any order a £ [0, 1] Renyi divergence is 
jointly convex in its arguments. That is, for any two pairs 
of probability distributions (PqiQo) an d (Pi,Qi), and any 



< A < 1 

D a ((l - X)P + XPi\\(l - X)Q + AQi) 

<(l-X)D a (P \\Qo) + XD a (P 1 \\Q 1 ). 

Equality holds if and only if 

a = 0: D {Po\\Qo) = D Q {Pi\\Qx), 

Po = =>■ Pi = (Qo-a.s.) and 
Pi = =>■ p = (Qi-a.s.); 
< a < 1: D a (P \\Q ) = D^P^QJ and 
PaQi =PiQo (p-a.s.); 

a = 1: p q 1 = piq (pL-a.s.) 

Proof: Suppose first that a = 0, and let P x = (1-A)P + 
APi and Q x = (1 - A)Q + XQ V Then 

(l-A)lnQ o (po>0) + AlnQi(pi>0) 

< In ((1 - A) Q (p > 0) + XQi (pi > 0)) 

< \nQ x {po > or p x > 0) = lng A (p A > 0). 

Equality holds if and only if, for the first inequality, Qo(po > 
0) = Qi(pi > 0) and, for the second inequality, p\ > => 
Po > (Qo-a.s.) and p > => pi > (Qi-a.s.) These 
conditions are equivalent to the equality conditions of the 
theorem. 

Alternatively, suppose a > 0. We will show that pointwise 



(1 - AK-Zo 1 "" + ^Pili 



< 



Px9\ 



Po 



(1 - X)p In — + Api In — > p\ In — 

go qi qx 



(0 < a < 1); 
(« = 1), 



(17) 

where p\ = (1 - A)p + Ap! and g x = (1 — A)<7 + Aft. For 
a = 1, ( fToT i then follows directly; for < a < 1, ([Tol l follows 
from (flTt by Jensen's inequality: 

(1 - A) hi J pUl~ a + X In /" p«ql~ a d M 

< In ^(1 -A) J p«ql- a dp + xj p1q\- a dp) . (18) 

If one of Po,Pi,qo and qi is zero, then ( fTTI i can be verified 
directly. So assume that they are all positive. Then for < 
a < 1 let f(x) = — x a and for a = 1 let /(x) = xlnx, such 
that ( fT7T > can be written as 



(1 - A)go 



qx 



f 



Xqi 

qx 



f 



>/ 



< fT7b is established by recognising this as an application of 
Jensen's inequality to the strictly convex function /. Regard- 
less of whether any of Po,Pi,qo and q\ is zero, equality holds 
in ( fTTI i if and only if po?i = Pi9o- Equality holds in ( fl~8l if 
and only if Jpo^d ° = / Pi q\~ a dp, which is equivalent 
to J D a (P ||Qo) = P> Q (P 1 ||Qi). ' ■ 
Joint convexity in P and Q breaks down for a > 1 (see 
Appendix [B), but some partial convexity properties can still 
be salvaged. First, convexity in the second argument does hold 
for all a El : 
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Theorem 12. For any order a £ [0, oo] Renyi divergence is 
convex in its second argument. That is, for any probability 
distributions P, Qo and Q\ 

D a (P\\(l-X)Qo + XQi) < (l-X)D a (P\\Q ) + XD a (P\\Q 1 ) 

(19) 

for any < A < 1. For finite a, equality holds if and only if 

a = 0: D Q (P\\Q Q ) = Do(P\\Qi); 
< a < oo: qo = q\ (P-a.s.) 

Proof: For a £ [0,1] this follows from the previous 
theorem. (For Pq = Pi the equality conditions reduce to the 
ones given here.) For a £ (1, oo), let Q\ = (1 — A)Qo + AQi 
and define f(x,Q\) = {p(x)/qx{x)) a ~ 1 . It is sufficient to 
show that 

]nEx~p\f(X,Qx)] 

< (1 - A) lnE x ~p[/(X, Qo)] +\lnE x ~p[f(X, Qi)]. 

Noting that, for every x £ X, f(x, Q) is log-convex in Q, this 
is a consequence of the general fact that an expectation over 
log-convex functions is itself log-convex, which can be shown 
using Holder's inequality: 

E P [f(X,Q x )} < Ep[/(X,Q ) 1 - A /(*,Qi) A ] 

Taking logarithms completes the proof of dl9T l. Equality holds 
in the first inequality if and only if qo = Qx (-P-a.s.), which 
is also sufficient for equality in the second inequality. Finally, 
( TP9l extends to a = oo by letting a tend to oo. ■ 
And secondly, Renyi divergence is jointly quasi-convex in 
both arguments for all a: 

Theorem 13. For any order a £ [0, oo] Renyi divergence 
is jointly quasi-convex in its arguments. That is, for any two 
pairs of probability distributions (Po,Qo) and (Pl,Qi), and 
any X £ (0, 1) 

D a ((l - A)P + \Pi\\(l - X)Q + XQi) 

<max{D a (P \\Qo),D a (P L \\Q 1 )}. 

Proof: For a £ [0, 1], quasi-convexity is implied by con- 
vexity. For a £ (l,oo), strict mono tonicity of x H> — ^j-lna; 
implies that quasi-convexity is equivalent to quasi-convexity 
of the Hellinger integral J p a q 1 ~ a dfi. Since quasi-convexity 
is implied by ordinary convexity, it is sufficient to establish 
that the Hellinger integral is jointly convex in P and Q. Let 
PA = (1 - X)p a + Xpi and q\ = (1 — X)q a + Xqi. Then joint 
convexity of the Hellinger integral is implied by the pointwise 
inequality 

(l-A)Po% +Xp 1 q 1 >p x Qx , 

which holds by essentially the same argument as for (fT7] i in 
the proof of Theorem [TT1 with the convex function f(x) = x a . 



Finally, the case a = oo follows by letting a tend to oo: 

A»((l - A)P + APi||(l - A)Q + AQi) 

= sup D a ((l - X)P + APi||(l - A)Q + AQi) 

Q<00 

< sup max{^ Q (P ||Qo), J Da(Pi||gi)} 

a <oo 

= max{sup D a (P \\Q ), sup D a (P 1 \\Q 1 )} 

a < oo a<oo 

= max{ J D oo (P ||g ), J D oo (P 1 ||Q 1 )}. 



C. Continuity 

In this section we study continuity properties of the Renyi 
divergence D a (P\\Q) of different orders in the pair of proba- 
bility distributions (P, Q). It turns out that continuity depends 
on the order a and the topology on the set of all probability 
distributions. 

If the set of probability distributions on (X, F) is equipped 
with the topology of setwise convergence (r-topology), 
then convergence of a sequence of probability distributions 
Pi, Pa, ... to a probability distribution Q means that P n (A) — s- 
Q(A) for any A £ T . Alternatively, one might consider the 
topology defined by the total variation distance 

V(P, Q)= [\p-q\dfM = 2 sup \P(A) - Q(A)\, 

in which P n Q means that V(P n ,Q) -> 0. The total 
variation topology is stronger than the topology of setwise 
convergence in the sense that convergence in total variation 
distance implies convergence on any A £ T . The two 
topologies coincide if the sample space X is countable. 

In general, Renyi divergence is lower semi-continuous for 
positive orders: 

Theorem 14. For any order a £ (0, oo], D a (P\\Q) is a lower 
semi-continuous function of the pair (P, Q) in the topology of 
setwise convergence. 

Proof: Suppose X = {ai, . . . , a^} is finite. Then for any 
simple order a 

k 

D a (P\\Q) = -^-rln^p^-", 
a — 1 * — ' 

i=l 

where pi = P(aj) and = Q(a,). If < a < 1, then pfq]~ a 
is continuous in (P, Q). For 1 < a < oo, it is only discontinu- 
ous at pi = q t = 0, but there pf q]~ a = = min (P ^) pfql~ a , 
so then pfq}~ a is still lower semi-continuous. These prop- 
erties carry over to Y^i=iPi1i^ a an d tnus A*(P||<3) is 
continuous for < a < 1 and lower semi-continuous for 
a > 1. A supremum over (lower semi-)continuous functions 
is itself lower semi-continuous. Therefore, for simple orders a, 
Theorem |2] implies that D a (P\\Q) is lower semi-continuous 
for arbitrary X. This property extends to the extended orders 
1 and oo by D p (P\\Q) = sup Q</3 D a (P\\Q) for /? £ {l,oo}. 

■ 

Moreover, if a £ (0, 1) and the stronger of the two 
topologies is assumed, then Theorem [16] below shows that 
Renyi divergence is continuous. 
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First we prove that the topologies induced by Renyi diver- 
gences of orders a G (0, 1) are all equivalent: 

Theorem 15. For any < a < (3 < 1 

a 1 -P 



that 



(3 1 -a 



D P (P\\Q) < D a (P\\Q) < D/)(P\\Q). 



This follows from the following symmetry-like property, 
which may be verified directly. 

Proposition 2 (Skew Symmetry). For any < a < 1 



D a (P\\Q) = D 1 _ a (Q\\P). 

1 — a 



Note that, in particular, Renyi divergence is symmetric for 
a = h, but that skew symmetry does not hold for a = and 
a= 1. 

Proof of Theorem [73r We have already established the 
second inequality in Theorem [5] so it remains to prove the 
first one. Skew symmetry implies that 



1 - a 



D a {P\\Q) = D!- a (Q\\P) 



>D 1 - P (Q\\P)=?-JL Dfl (p\\Q), 

from which the result follows. ■ 
By (O, these results show that, for a G (0, 1), 
D a (P n \\Q) —> is equivalent to convergence of P n to Q 
in Hellinger distance, which is equivalent to convergence of 
P n to Q in total variation 11251 p. 364]. Next we shall prove a 
stronger result on the relation between Renyi divergence and 
total variation. 

Theorem 16. For a G (0, 1), the Renyi divergence D a {P\\Q) 
is a continuous function of (P, Q) in the total variation 
topology. 

Lemma 3. Let < a < 1. Then for all x, y > and e > 



\x a ~y a \ < e a +e a - 1 \x-y\ 



Proof: If x, y < e or x = y the inequality \x a — y a \ < e c 
is obvious. So assume that x > y and x > e. Then 

\ x a_ y a\ < \x a -0 a \ = ^ a _ 1<ga _ 1 



Proof of Theorem \16\ First note that Renyi diver- 
gence is a function of the power divergence d a (P,Q) = 



D a (P\\Q) = 



a- 1 



In(l-da(P,Q)). 



Since x n- — ^ ln(l — x) is continuous, it is sufficient to 
prove that d a (P,Q) is a continuous function of (P,Q). For 
any e > and distributions P\ , P2 and Q, Lemma [3] implies 



|d Q (Pi,Q)-d Q (P 2 ,<9)| < y 



< 



= £ 



dpi 

dQ 

a-l 



dP 2 
dQ 
dP dP 2 
dQ dQ 
dp dP 2 
dQ dQ 



dQ 



dQ 



dQ 



e «-iy(P X) P 2 ). 



As d Q (P,Q) 



= di_ a (Q, P), it also follows that 

1-a + E - a V(Qi,Q 2 ) for any 



|da(P,Qi)-d Q (P,Q 2 )| < £ 
<5i,<5 2 and P. Therefore 

Ma(Pl,gi)-da(P 2 ,Q 2 )| 

< K(Pi,Qi)-d Q (P 2 ,Qi)| 

+ M Ct (P2,Ql)-da(P 2 ,g 2 )| 

< £ Q + £ Q ' l V{Px, P 2 ) + E 1 "" + s~ a V{Q u Q 2 ), 

from which the theorem follows. ■ 
A partial extension to a = follows: 

Corollary 1. The Renyi divergence Dq(P\\Q) is an upper 
semi-continuous function of (P, Q) in the total variation 
topology. 

Proof: This follows from TheoremfTSIbecause Dq(P\\Q) 
is the infimum of the continuous functions (P, Q) 1— > 
D a (P\\Q) for a G (0, 1). ■ 
Finally, if we consider continuity in Q only, we obtain: 

Theorem 17. Suppose X is finite, and let a £ [0,oo]. Then 
for any P the Renyi divergence D a (P\\Q) is continuous in Q. 

Proof: Directly from the closed-form expressions for 
Renyi divergence. ■ 

D. Limits of a -Algebras 

As shown by Theorem [2] there exists a sequence of finite 
partitions V\ , V2 , ■ ■ ■ such that 

D a (P lv JQ lv Jt D a (P\\Q). (21) 

TheoremQjUbelow elaborates on this result. It implies that (I2TI) 
holds for any increasing sequence of partitions V\ C P 2 C • • ■ 
that generate er-algebras converging to T, in the sense that 
T = a dJ^L} Vn). A corresponding result holds for infinite 
sequences of increasingly coarse partitions, which is shown 
by Theorem [T9l 

Theorem 18 (Increasing). Let T\ C Ti C • ■ • C T be a non- 
decreasing family of a -algebras, and let J-^ = a (U^Li Fn) 
be the smallest a -algebra containing them. Then for any order 

a G (0, 00] 



lim D a (P|^ n ||Q|^J= J D a (P|^ oo ||Q|^ oo ). 



(22) 



For a = 0, d22l) does not hold. A counterexample is given 
after Example [2] below. 

Lemma 4. Let Pi C T% C • • • C T be a nondecreasing family 
of a-algebras, and let P and p be probability distributions on 
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(X, J 7 ) such that P <C (J.. Let p be the density of P with 
respect to fi. Then the family of random variables {X n }n>i 
with members X n = E [p\ !F n ] is uniformly integrable (with 
respect to fi). 

The proof of this lemma is a special case of part of the proof 
of Levy's theorem in Shiryaev's textbook 11251 . We repeat it 
here for completeness. 

Proof: For any constants b, c > 



X n d[i = pd/j 

X n >b JX n >b 

< pdfi+ pdfi 

Jx n >b,p<c JX n >b,p>c 



< c ■ fi (X n > b) + I pdfj, 

Jp>c 

(*) C f C f 

< -E[I„]+ / p d/i = -+ / pd/i, 

Jp>c Jp>c 

in which the inequality marked by (*) is Markov's. Conse- 
quently 

lim sup / |X„|d/i= lim lim sup / |X„|d/i 

b-s-oo „ J X „>b c^ocb^oo n J X „>b 



< lim lim - 

c->oo b— >oo 



lim 

c— foo 



p>c 



which proves the lemma. ■ 
Proof of Theorem [TS[ As by the data processing inequal- 
ity D a (P\jr n \\Q\jr n ) < D a (P\\Q) for all n, we only need 
to show that lim^oo D a (P^J\Q^J > D^P^WQ^). 
To this end, assume without loss of generality that T = J- x 
and that /i is a probability distribution (i.e. p — (P + Q)/2). 
Let X n = E [p\ F n ] and Y n = E[q\J- n ], and define the 
distributions P n and Q n on (X, J 7 ) by 

P n (A)= [ X n dfi, Q n {A)= [ Y n dfi (i6J), 

J A J A 

such that, by the Radon-Nikodym theorem and Proposition Q] 

4& = *n = and % = Yn - 

follows that 



(/Lt-a.s.) It 



D a {P n \\Q n ) = DaiRjrJQtjrJ 



for < a < oo and therefore by continuity also for a = oo. 
We will proceed to show that (P n ,Q n ) — ► (P,Q) in the 
topology of setwise convergence. By lower semi-continuity 
of Renyi divergence this implies that linin^oo D a (P n \\Q n ) > 
Da{P\\Q), from which the theorem follows. By Levy's the- 
orem [25], liirin^oo X n = p (/i-a.s.) Hence uniform integra- 
bility of the family {X n } (by Lemma |4]i implies that for any 
A £ T 



lim P n (A) = lim / X n dfi 

n— >oo n— >oo J 



pdp = P(A) 



Thm. 5, p. 189]. Similarly lim^^ Q n (A) = Q(A), so 
we find that (P n ,Qn) —> {P,Q), which completes the proof. 

■ 

Theorem 19 (Decreasing). Let T D T\ 3 T% 3 ■ ■ ■ 

be a nonincreasing family of a-algebras, and let Too = 



Pl^Li 3~ n be the largest a-algebra contained in all of them. 
Let a £ [0,oo). If a £ [0, 1) or there exists an m such that 
D a (P\jr m \\Q\jr m ) < oo, then 

lim D a (P^ n \\Q^J =D a (P^J\Q lr J. 

The theorem cannot be extended to the case a = oo. 

Lemma 5. Let J- D T\ 2 T<> 2 • • ■ be a nonincreasing 
family of a-algebras. Let a £ (0, oo), p n = d ^'^" , q n 



dQl. 



d ^ uiid X n = f(^ L ), where f(x) = x a if a 7^ 1 and 
f(x) = x\nx + e _1 if a = 1. If a £ (0, 1), or Eq[Xl] < 00 
and PcQ, then the family {X n } n >i is uniformly integrable 
(with respect to Q). 

Proof: Suppose first that a £ (0,1). Then for any b > 

(l-a)/a 



X n dQ < 



X n >b 



dQ 



>X n >b 



and, as X n > 0, lim b ^ 00 sup n J^ x ^ >b \X n \dQ = 0, which 
was to be shown. Alternatively, suppose that a £ [1, 00) 
and assume without loss of generality that T = T\. Then 



pd/i = 0, ^ 



dPi: 



dQl. 



(Q-a.s.) and hence by Proposition [T] and Jensen's 



inequality for conditional expectations 
dP 



Xn = f E 



dQ 



< E 



\dQ 



= E[Xi| T n \ 



(Q-a.s.) As min^ a;lnx 
for any b, c > 



"\ it follows that X n > and 



/ \X n \dQ= f X n dQ 

J\X n \>b JX„>b 

< f E[Xl| F n ] dQ= f X 

JX n >b JX n >b 



X 1 dQ 



X n >b,X!<c 



idQ 
X l dQ 



< c ■ Q{X n > b) + [ X 

JX 1 >c 

<^E Q [X„]+ f X^Q 
b Jx 1 >c 

< C -V Q [X 1 ]+ ( X 1 dQ, 

JX t >c 



X n >b,X 1 >c 

idQ 



where Eq[X„] < Eq[Xi] in the last inequality follows from 
the data processing inequality. Consequently, 



lim sup / |X„|d<5 = lim lim sup / \X. 

b^oo n Jlx n \>b c^oob^oo n J\ X „\ > b 



dQ 



|X„|>b 



< lim lim jE Q [X 1 }+ lim / X 1 dQ = 0, 

and the lemma follows. ■ 
Proof of Theorem U9[ First suppose that a > and, for 



1,2,..., 00, let p n 



/ (Zj^) with f(x) = x a if a 1 and f(x) = xln 
if a = 1, as in Lemma |5J If a > 1, then assume without 



and X n 



x + e 
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loss of generality that T = T\ and m = 1, such that 
D a (P\jr m \\Q\jr m ) < oo implies P <^Q. Now, for any a > 0, 
it is sufficient to show that 

EQ[jr n ]->E [X oo ]. (23) 

By Proposition [T] p n = E^plJ 7 ,,] and q n = E M [g| J 7 ,,]. 
Therefore by a version of Levy's theorem for decreasing 
sequences of er-algebras ||28] Theorem 6.23], 

Pn = E M [p\ P n ] -» E^ [p| J"oo] = Poo, 

(/i-a.s.) 

g n = E M [g| -> E M [g| T^} = q x , 

and hence X n —> X x (/i-a.s. and therefore Q-a.s.) If < a < 
1, then 

E Q [X„] = £ M [K^~ Q ] < E M [a Pn + (1 - a)g n ] = K oo. 

And if a > 1, then by the data processing inequality 
D a (P\jr n \\Q\jr n ) < oo for all n, which implies that also in this 
case Eq[X„] < oo. Hence uniform integrability (by Lemma|5} 
of the family of nonnegative random variables {X n } implies 
d23l 125] Thm. 5, p. 189], and the theorem follows for a > 0. 
The remaining case, a = 0, is proved by 

lim D (P\rJQ\rJ 

n—too 

= inf inf D a (P ]r JQ^J = inf inf D a (P^JQ^J 

n a>0 a>0 n ' ' 

= miD a (P^J\Q^J = D (P^J\Q^J. 

■ 

E. Distributions on Sequences 

Suppose {X 00 ,^ 00 ) is the direct product of an infinite 
sequence of measurable spaces {X\, Ti), (A2, T2), ■ . ■ That 
is, X°° = X\ x X2 x ■ ■ ■ and J 700 is the smallest cr-algebra 
containing all the cylinder sets 

S n (A) = {x°° <= X°° \ Xl ,...,x n eA}, AeF 1 , 

for n = 1,2,..., where T n = T\ ® ■ ■ ■ ® F n . Then a 
sequence of probability distributions P 1 ,P 2 , . . ., where P n 
is a distribution on X n = X\ X • • • X X n , is called consistent 
if 

P n+1 (Ax X n+1 ) = P n (A), AeJ". 

For any such consistent sequence there exists a distribution 
P°° on (A 100 , J 700 ) such that its marginal distribution on X n 
is P n , in the sense that 

P°°(S n (A)) = P n {A), A e J"". 

If P , P 2 , . . . and Q 1 , Q 2 , . . . are two consistent sequences of 
probability distributions, then it is natural to ask whether the 
Renyi divergence D a (P n \\Q n ) converges to D^P 00 ^ 00 ). 
The following theorem shows that it does for a > 0. 

Theorem 20 (Consistent Distributions). Let P X ,P 2 ,... and 
Q 1 ,^ 2 ,-.. be consistent sequences of probability distribu- 
tions on (X \ J 71 ), (X 2 , J- 2 ), ■ ■ ■, where, for n = l,...,oo, 
(X n , J 7 ™) is the direct product of the first n measurable spaces 
in the infinite sequence (X\, Fi), {X2, ^2), • ■ • Then for any 
a G (0, 00] 

D a (P n \\Q n ) D a (P™\\Q°°) 



as n — > 00. 

Proof: Let Q n = {S n (A) \ A e J 7 ™}. Then 

C(P"|Q n ) - D a (P^ n \\Q^ n ) ^ ^(P 00 ^ 00 ) 

by Theorem Qj] ■ 
As a special case, we find that finite additivity of Renyi 
divergence, which is easy to verify, extends to countable 
additivity: 

Theorem 21 (Additivity). For n = 1,2,..., let (P„,Q n ) 
be pairs of probability distributions on measurable spaces 
(X n , J- n ). Then for any a £ [0, 00] and any N £ {1, 2, . . .} 

N 

J2 D a( P n\\Qn)=D a (P 1 x---xP N \\Q 1 x---xQ N ), (24) 

n=l 

and, except for a — 0, also 

oo 

D a (P n \\Q n ) = D a (Px x P 2 x ■ ■ ■ \\Q t x Q 2 x ■ ■ ■ ). (25) 

n=l 

Countable additivity as in (l25l l does not hold for a = 0. A 
counterexample is given following Example [2] below. 

Proof: For simple orders a, (l24i i follows from indepen- 
dence of P n and Q n between different n, which implies that 

As N is finite, this extends to the extended orders by continuity 
in a. Finally, ( fZ5b follows from Theorem |20]by observing that 
the sequences P N = P 1 x- ■ - x P N and Q N = Qi x ■ • • x Q N , 
for N = 1, 2, . . ., are consistent. ■ 

F. Absolute Continuity and Mutual Singularity 

Shiryaev 11251 pp. 366, 370] relates Hellinger integrals to 
absolute continuity and mutual singularity of probability distri- 
butions. His results may more elegantly be expressed in terms 
of Renyi divergence. They then follow from the observations 
that Dq(P\\Q) = if and only if Q is absolutely continuous 
with respect to P and that Do(P||<9) = oo if and only if P 
and Q are mutually singular, together with right-continuity of 
D a (P\\Q) in a at a = 0. 

Theorem 22 ( l25l Theorem 2, p. 366]). The following con- 
ditions are equivalent: 

(i) Q « P 

(ii) Q(p > 0) = 1, 
(Hi) D Q {P\\Q) = 0, 

(iv) lim ai0 D a (P\\Q) =0. 

Proof: Clearly (E} is equivalent to Q(p = 0) = 
0, which is equivalent to (Q). The other cases follow by 
Hm^o A,(P||Q) = Do(P\\Q) = -lnQ(p > 0). ■ 

Theorem 23 ( l25l Theorem 3, p. 366]). The following con- 
ditions are equivalent: 

a) pig, 

(ii) Q(p > 0) = 0, 

(Hi) D a (P\\Q) = oo for some a € [0, 1), 
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(iv) D a (P\\Q) = oo for all a e [0,oo]. 

Proof: Equivalence of (0, dn|i and Dq(P\\Q) = oo follows 
from definitions. Equivalence of Dq(P\\Q) = oo and divj 
follows from the fact that Renyi divergence is continuous on 
[0, 1] and nondecreasing in a. Finally, diTTb for some a E (0, 1) 
is equivalent to 

J pV' a dM = 0, 

which holds if and only if pq = (/i-a.s.). It follows that in 
this case diiit is equivalent to (01. ■ 
These properties give a convenient mathematical tool to 
prove absolute continuity or mutual singularity of infinite 
product distributions, as illustrated by the following proof by 
Shiryaev 11251 of the Gaussian dichotomy ||291 - ||3T1 . 

Example 2 (Gaussian Dichotomy). Let P = Pi x P 2 x • • • 
and Q = Qi x Q-2 x ■ ■ ■ , where P n and Q n are Gaussian 
distributions with densities 

p n {x) = ^-e-^-^\ q n (x) = J-e-^~^\ 
where r = 2ir. Then 

D a (P n \\Qn) = ^Oi(fX n ~ ^nf , 

and by additivity for a > 

1 °° 

D a {P\\Q) = -a^( M „-f„) 2 - 

n=l 

Consequently, by Theorems [22] and [23] and symmetry in P 
and Q: 

oo 

Q<P O P«Q <=> ^(m« - ^«) 2 < oo, 

n=l 
oo 

Q_LP - ^n) 2 = OO. 

n=l 

The observation that P and Q are either equivalent (both P <C 
Q and Q -C P) or mutually singular is called the Gaussian 
dichotomy. 

Example [2] shows that countable additivity does not hold for 
a = 0: if Er=i0"n-O 2 = oo, then £^ =1 D (P»||Q») = 
for all AT, while D (P\\Q) = oo. In light of the proof of 
Theorem [21] this also provides a counterexample to (l22l for 
a = 0. 

The Gaussian dichotomy raises the question of whether the 
same dichotomy holds for other product distributions. Let P ~ 
Q denote that P and Q are equivalent (both P <C Q and 
Q < P). Suppose that P = Pi x P 2 x • • • and Q = Qi x 
Q2 x ■ • • . where P ra and Q„ are arbitrary distributions on 
arbitrary measurable spaces. Then if P„ 7^ Q n for some n, 
P and Q are not equivalent either. The question is therefore 
answered by the following theorem: 

Theorem 24 (Kakutani's Dichotomy). Let a € (0, 1) and let 

P = Pi x P 2 x • • • anaf Q = Qi x Q 2 x ■ ■ ■ , where P n and 



Q n are distributions on arbitrary measurable spaces such that 
P n ~ Q n . Then 

00 

Q^p ^ Y^D a (P n \\Q n ) < 00, 

n=l 
00 

Q±P ^ J D a (P„ || Q„) = 00. 

Proo/: If E"=i ^a^nllQn) = oo, then D a (P\\Q) = 00 
and Q _L P follows by Theorem [23] On the other hand, if 
E^Li ^alPnllQn) < oo, then for every e > there exists an 
iV such that 

OO 

D a (P n \\Q n ) <e, 

n=N+l 

and consequently by additivity and monotonicity in a: 

D (P\\Q) = lim D a (P\\Q) 

< limD a (Pi x • • • x Pjv||Qi x • • • x Q N ) + e = e. 

As this holds for any e > 0, Dq(P\\Q) must equal 0, and, by 
Theorem[22] Q < P. As Q < P implies Q / P, Theoreml2l1 
implies that £) Q (Q||P) < 00, and by repeating the argument 
with the roles of P and Q reversed we find that also P <C Q, 
which completes the proof. ■ 

Theorem l24l (with a = i) is equivalent to a classical result 
by Kakutani |32| . which was stated in terms of Hellinger 
integrals rather than Renyi divergence, and according to Gibbs 
and Su l22l might be responsible for popularising Hellinger 
integrals. As shown by Renyi |33|, Kakutani's result is related 
to the amount of information that a sequence of observations 
contains about the parameter of a statistical model. 

Contiguity and entire separation are asymptotic versions of 
absolute continuity and mutual singularity 1341 . As might be 
expected, analogues of Theorems l22l and |231 also hold for these 
asymptotic concepts. 

Let {XrnFn)n=i.2,... be a sequence of measurable spaces, 
and let (P n )n=i,2,,., and (Q n ) n =i,2,... be sequences of distri- 
butions on these spaces. Then the sequence (P n ) is contiguous 
with respect to the sequence (Q n ), denoted (P„) <\ (Q„), 
if for all sequences of events (A n € Fn)n=\,2,... such that 
Qn(A n ) — > as n — > 00, we also have P n (A n ) — > 0. If 
both (P„) <\ (Q n ) and (Q„) <J (Pn), then the sequences 
are called mutually contiguous and we write (P„) <> (Q n ). 
The sequences (P„) and (Q n ) are entirely separated, de- 
noted (P„) A (Q n ), if there exist a sequence of events 
(A n € -Pn)n=i.2,... and a subsequence (nfe)fe=i,2,... suc h that 
p n k (Ai k ) -> and Q„ fc (Af nfe \ A„ fc ) ->• as k -> 00. 

Contiguity and entire separation are related to absolute 
continuity and mutual singularity in the following way f25l 
p. 369]: if X n = X, P n =P and Q n = Q for all n, then 

(Pn) < (Qn) P « Q, 

(P„) <> (Q„) ^ P ~ Q, 

(Pn)A(Qn) & P -LQ. 

Theorems 1 and 2 by Shiryaev fl25l p. 370] imply the following 
two asymptotic analogues of Theorems [22] and [23] 
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Theorem 25. The following conditions are equivalent: 

(i) (Q„) < (P„), 
f;7J limlimsup-D a (P n ||(3 n ) = 0. 

Theorem 26. 77ie following conditions are equivalent: 

(i) (P n ) A (Q n ), 

( ii) lim lira sup D a (P n \\ Qn ) = oo, 

(Hi) limsup D a (P n \\ Q n ) = oo for some a G (0, 1). 

n— >oo 

(iv) limsupD a (P„||Q rl ) = oo for all a G (0, oo]. 

G. Taylor Approximation for Parametric Models 

Suppose {Pq I 9 G C R} is a parametric statistical 
model. Then it is well known that, for sufficiently regular 
parametrisations, a second order Taylor approximation of 
D(P B \\P e >) in 9' at 9 in the interior of 9 yields 

where J(9) = E [(^lnpg) 2 ] denotes the Fisher information 
at 9 (see e.g. |27 Problem 12.7]). Haussler and Opper J6] 
argue that this property generalises to 

^ e¥ ^W^(p e \\p,) = p(e) 

for any a G (0, oo). 

IV. MlNIMAX RESULTS 

A. Hypothesis Testing and Chernoff Information 

Renyi divergence appears in bounds on the error proba- 
bilities when testing a probabilistic hypothesis Q against an 
alternative P (4), |35l , [36). This can be explained by the 
fact that (1 — a)D a (P\\Q) equals the cumulant generating 
function for the random variable ln(p/q) under the distribution 
Q (provided a G (0, 1) or P < Q) g). The following 
theorem relates this cumulant generating function to two 
Kullback-Leibler divergences that involve the distribution P a 
with density 



Pa = 



q 1 - a p a 



(26) 



J q 1 - a p a dp' 

which is well defined if and only if < f p a q 1 ~ a dp < oo. 
Theorem 27. For any simple order a 

(1 - a)D a (P\\Q) = inf {aD(R\\P) + (1 - a)D(R\\Q)} , 

(27) 

with the convention that aD(R\\P) + (l—a) D(R\\Q) = oo if it 
would otherwise be undefined. Moreover, if the distribution P a 
with density ( 126b is well defined and a G (0, 1) or D(P a \\P) < 
oo, then the infimum is uniquely achieved by R = P a . 

This result gives an interpretation of Renyi divergence as a 
trade-off between two Kullback-Leibler divergences. 

Remark 1. Theorem [27] was formulated and proved for distri- 
butions on finite sets by Shayevitz |fT71 . but appeared in the 
above formulation already in lfT6l . 



Proof of Theorem \27\ First suppose that P a is well 
defined or, equivalently, that D a (P\\Q) < oo. Then for 

a G (0, 1) or D(R\\P) < oo, we have 

aD(R\\P) + (l-a)D(R\\Q) = D(R\\P a )-]n J p a q 1 ' a dp. 

Hence, if < a < 1 or D(P a \\P) < oo, the infimum over 
R is uniquely achieved by R = P a , for which it equals (1 — 
a)D a (P\\Q) as required. If, on the other hand, a > 1 and 
D(P a \\P) = oo, then we still have 

inf {aD(R\\P) + (1 - a)D(R\\Q)} > (1 - a)D a (P\\Q). 

(28) 

Secondly, suppose a G (0,1) and D a (P\\Q) = oo. Then 
P _L Q, and consequently either D(R\\P) = oo or D(R\\Q) = 
oo for all R, which means that d27b holds. 

Next, consider the case that a > 1 and P Q. Then 
D a (P\\Q) = oo and the infimum over R is achieved by R = 
P, for which it equals — oo, and again ( 127] ) holds. 

Finally, we prove (|27| | for the remaining cases: a > 1, P <€. 
Q and either: (1) D a (P\\Q) < oo, but D(P a \\P) = oo; or (2) 
D a (P\\Q) = oo. To this end, let P c = P{- \ p < cq) for all 
c that are sufficiently large that P(p < cq) > 0. The reader 
may verify that D a (P c \\Q) < oo and D(5||P C ) < oo for 
s = PcQ 1 ^" / J Pcl 1 ^" d/i, so that we have aheady proved 
that d27l ) holds if P is replaced by P c . Hence, observing that 
for all R 



D(R\\P C 



we find that 



oo if R ^ P c 

D(R\\P) + lnP(p < pc) otherwise, 



inf{a£>(P||P) + (1 - a)D(R\\Q)} 



< limsup I — alnP(p < cq) 

c— >oo ^ 

+ inf {aD(R\\P c ) + (1 - a)D(R\\Q)} 

< limsup (1 - a)D a {P c \\Q) < (1 - a)D a (P\\Q) 



where the last inequality follows by lower semi-continuity of 
D a (Theorem [T4T >. In case 2, (|27] | follows immediately. In 
case 1, d27] i follows by combining this inequality with its 
converse (f28t . ■ 
Theorem |27] shows that (1 — a)D a (P\\Q) is the infimum 
over a set of functions that are linear in a, which implies the 
following corollary: 

Corollary 2. The function (1 — a)D a (P\\Q) is concave in a 
on [0, oo], with the conventions that it is at a — 1 even if 
D(P\\Q) = oo and that it is at a = oo if P = Q. 

Proof: Suppose first that D(P\\Q) < oo. Then d27]l also 
holds at a = 1. Hence (1 — a)D a (P\\Q) is a point-wise 
infimum over linear functions on (0, oo), and thus concave. 
This extends to a G {0, oo} by continuity. 

Alternatively, suppose that D(P\\Q) = oo. Then (1 — 
a)D a (P\\Q) is still concave on [0,1), where it is also 
nonnegative. And by monotonicity of Renyi divergence, we 
have that P Q (P||<5) = oo for all a > 1. Consequently, 



JOURNAL OF LJTeX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 



14 



(1 — a)D a (P\\Q) is nonnegative and concave for a € [0, 1), 
at a = 1 it is (by convention) and for a G (1, oo] it is — oo. 
It then follows that (1 — a)D a (P\\Q) is concave on all of 
[0, oo], as required. ■ 
As one might expect from continuity of D a (P\\Q), the 
terms on the right-hand side of ((27) are continuous in a, at 
least on (0, 1): 

Lemma 6. If D(P\\Q) < oo or D(Q\\P) < oo, then both 
D(P a \\Q) and D(P a \\P) are finite and continuous in a on 
(0,1). 

Proof: The lemma is symmetric in P and Q, so sup- 
pose without loss of generality that D(P\\Q) < oo. Then 
D a (P\\Q) < D(P\\Q) < oo implies that P a is well defined 
and finiteness of both D(P a \\Q) and D{P a \\P) follows from 
Theorem 1271 Now observe that 



D{P a \\Q) 



J p a q 1 ' a dfi 



In 



(l-a)D a (P\\Q). 



Then by continuity of D a (P\\Q) and hence of J p a q 1 ~ a d/j, in 
a, it is sufficient to verify continuity of E,Q[(p/q) a \n(p/q) a }. 
To this end, observe that 



\(p/q) a \n(p/q) a \< 



1/e 

(p/q)kL(p/q) 



if p < q, 
if p > q. 



As D(P\\Q) < oo implies Eq [l{j,> g } (p/q) \n{p/q)\ < oo, 
we may apply the dominated convergence theorem to obtain 



lim Eg 



£1 ln(P 



E c 



ml 2 



for any a* G (0,1), which proves continuity of D(P a \\Q). 
Continuity of D(P a \\P) now follows from Theorem 127] and 
continuity of (1 - a)D a (P\\Q). ■ 

Theorem 28. Suppose that D(P\\Q) < oo. Then the following 
minimax identity holds: 



sup inf {aD(R\\P) 

aG(0,oo) R 



(1 - a)D(R\\Q)} 



= inf sup {aD(R\\P) + (1 - a)D(R\\Q)} , (29) 

R qG(0,oo) 

with the convention that aD(R\\P) + (1 — a)D{R\\Q) = oo 
if it would otherwise be undefined. Moreover, (1291 l still holds 
if a is restricted to (0, 1) on its left-hand side, and if there 
exists an a* G (0, 1) such that D(P a * \\P) = D(P a * \\Q), then 
(a*, P Q *) is a saddle-point for ( 129b and both sides of ( 129b are 
equal to 

(l-a*)D a ,(P\\Q)= sup (1 - a)D a (P\\Q) 

«e(o,i) (30) 

= D(P a ,\\P) = D{P a ,\\Q). 



The minimax value defined in (|29t is the Chernoff informa- 
tion, which gives an asymptotically tight bound on both the 
type 1 and the type 2 errors in tests of P vs. Q. The same 
connection between Chernoff information and D(P a *\\P) is 
discussed by Cover and Thomas ||27l Section 12.9], with a 
different proof. 



Proof of TheoremUB Let /(a, R) = aD(R\\P) + (1 - 
a)D{R\\Q). For a e (0,1), D a (P\\Q) < D(P\\Q) < oo 
implies that P a is well defined. Suppose there exists a* E 
(0, 1) such that D(P a ,\\P) = D(P a *\\Q). Then Theorem[27] 
implies that (a*,P a *) is a saddle-point for f(a,R), so that 
( |29l > holds ll37l Lemma 36.2], and Theorem [27] also implies 
that all quantities in (f30b are equal to f(a*,P a *). 

Let A be either (0,1) or (0, oo). As the sup inf is never 
bigger than the inf sup fl37l Lemma 36.1], we have that 

sup inf /(a, R) < sup inf /(a, R) < inf sup /(a,i?), 



*e(o,oo) 



a£(0,oo) 



so it remains to prove the converse inequality. 

By Lemma|6]we know that both D(P a \\P) and D(P a \\Q) 
are finite and continuous in a on (0, 1). By the intermediate 
value theorem, there are therefore three possibilities: (1) there 
exists a* e (0,1) such that D{P a ,\\P) = D(P a .\\Q), 
for which we have already proved (l29t ; (2) D(P a \\P) < 
D{P a \\Q) for all a G (0, 1); and (3) D{P a \\P) > D(P a \\Q) 
for all a G (0, 1). 

We proceed with case (2), observing that 



inf sup f(a,R) 



inf 



a£(0,oo) 



R: £>(i?HQ)<oo ag ( 0iOO ) 



sup f(a,R) 



+ 



{d(r\\q) 

sup a(D(R\\P) - D(R\\Q))\ 

ztn , — ^^ > 



inf 

R: D(R\\Q)< 



aG(0,oo) 

inf D(R\\Q) 

R: D(R\\P)<D{R\\Q)<oo 

< inf D(P a \\Q). 

0<Q<1 



Now by Theorem |271 

[PJQ 

Urn inf \D a (P\\Q) 



inf £>(PallQ) < liminf D(P a \\Q) 

0<a<l a|0 



1 - a 



-D 



(Pa\\P)} 



< limD a (P\\Q) = Um(l - a)D a (P\\Q) 
= liminf /(a, R) < sup inf /(a, i?), 

aj,0 i? ag>t ^ 

as required. It remains to consider case (3), which turns out 
to be impossible by the following argument: two applications 
of Theorem [27] give 

D 1/2 {P\\Q) = inf \D(P a \\P) + D(P a \\Q)\ 

0<Q<1 I. J 

<2 inf D(P a \\P) < 21imsupD(P a ||P) 



0<a<l 



= 2Umsup{^D a (P||Q) D(P„||P)} 

a fi (. a a J 

< 2 lim sup— -D a (P\\Q) = 0. 

afl Oi 

It follows that P = Q, which contradicts the assumption that 
D(P a \\P) > D(P a \\Q) for any a G (0, 1). ■ 

B. Channel Capacity and Minimax Redundancy 

Consider a non-empty family {Pg \ 8 G 0} of probability 
distributions on a sample space X. We may think of 9 as 
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a parameter in a statistical model or as an input letter of 
an information channel. In the main results of this section 
we will only consider finite X, with n elements. Whenever 
distributions on are involved, we also implicitly assume 
that 6 is a topological space that is equipped with the Borel 
er-algebra, and that the map 9 n- Pg is measurable. 
We will study 

C a =supinf [ D a (P e \\Q) dn(6), 

TT Q J 

which has been proposed as the appropriate generalisation of 
the channel capacity from a = 1 to general a J4], ifTHl . 
If X is finite, then the channel capacity is also finite: 

Theorem 29. If X has n elements, then C a < In n for any 

a G [0, oo]. 

Proof: Let U denote the uniform distribution on X. Then 

supinf f D a (P e \\Q) dir^) < sup [ D a (Pg\\U) dn(8) 

TT Q J II J 

= sup A, (P e \\U) < sup .Doo (P e || 17) 
e d 

1 P °W <r 1 

= sup In max — ; — < In n. 

x 1/n 



For a = 1, it is a classical result by Gallager and Ryabko 
that the channel capacity equals the minimax redundancy: 

R a = inf sup D a (Pe|| Q). 
Q eee 

For finite 0, Csiszar [0] has shown that this result in fact 
extends to any a g (0, oo), noting that minimax redundancy 
R a (and therefore the channel capacity C a ) may be geometri- 
cally interpreted as the "radius" of the family of distributions 
{Pg | 9 £ 0} with respect to the Renyi divergence of order 
a. It turns out that Csiszar's result extends to general and 
all orders a: 

Theorem 30. Suppose X is finite. Then for any a G [0, oo] 
the channel capacity equals the minimax redundancy: 

C a — R a . (31) 

For a = 1, Haussler 1391 has extended this result to infinite 
sample spaces X. It seems plausible that his approach might 
extend to other orders a as well. 

Equation [31] is equivalent to the minimax identity 



supinf ip a {^, Q) = inf sup-0 Q (7r, Q), 

7T Q Q 7T 



M*,Q)= / D a (P g \\Q) d7r(0). 



(32) 



where 



We will prove this identity using Sion's minimax theorem IHOL 
[41], which we state with its arguments exchanged to make 
them line up with the arguments of ip a : 

Theorem 31 (Sion's Minimax Theorem). Let A be a convex 
subset of a linear topological space and B a compact convex 
subset of a linear topological space. Let f : A x B — > R be 
such that 



(i) /(•, b) is upper semi-continuous and quasi-concave on A 
for each b G B; 

(ii) f(a, ■) is lower semi-continuous and quasi-convex on B 
for each a £ A 

Then 

sup min f(a, b) = min sup /(a, b). 

aeA beB' b£B aeA 

Proof of TheoremUUl Sion's minimax theorem cannot be 
applied directly, because ip a may be infinite. For A G (0, 1), 
we therefore introduce the auxiliary function 

^(7r,Q) = V«(7r,(l-A)?7 + AQ), 

where U is the uniform distribution on X. Finiteness of ipa 
follows from 



D a (P e \\(l - X)U + XQ) < D a (P e \\U) - ln(l - A) 
< D^PoWU) - ln(l - A) < Inn - ln(l - A), 



(33) 



where n denotes the number of elements in X. 

To verify the other conditions of Theorem |3T] we observe 
that ipa('iQ) i s linear, and hence continuous and concave. 
Convexity of ipai^i ') follows from convexity of ip a (n,-), 
which holds because ip a (^, ■) is a linear combination of con- 
vex functions. Continuity of ip^i^, ') follows by the dominated 
convergence theorem (which applies by (l33t ) and continuity 
of Do^PeH-). Thus we may apply Sion's minimax theorem. 

By 

D a (P \\(l - X)U + XQ) < D a {P \\Q) -InA, 

we also have ipai^iQ) ^ ^ai^iQ) — InA, and hence we may 
reason as follows: 

sup inf ip a (tt, Q) — In A > sup inf -0^ (tt, Q) 

TT Q TT Q 

= inf sup ipa( n y Q) ^ inf sup ^(tt, Q)- 

Q 7T Q TT 

By letting A tend to 1 we find 

supinf ip a (tt, Q) > inf sup ip a (TT, Q). 

TT Q Q TT 

As the supinf never exceeds the inf sup [37, Lemma 36.1], 
the converse inequality also holds, and the proof is complete. 

■ 

A distribution 7r opt on the parameter space is a capacity 
achieving input distribution if 

inf / D a (P d \\Q) dn opt (6) = C a . 
Q J 

A distribution Q opt on X may be called a redundancy achiev- 
ing distribution if 

supD Q (Pe|| Qopt) = R a - 

8 

If the sample space is finite, then a redundancy achieving 
distribution always exists: 

Lemma 7. Suppose X is finite and let a G [0,oo]. Then 
the function Q t— > sup 9 D a {Pg\\Q) is continuous and convex, 
and has at least one minimum. Consequently, a redundancy 
achieving distribution Qopt exists. 
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For a > 0, the redundancy achieving distribution is in fact 
unique, as will be shown by Theorem |34l below. 

Proof: Denote the number of elements in X by n, let 

An = {(pi,-.-,Pn) I YA=\Pi = l iPi > °1 denote 
the probability simplex on n outcomes, and let f(Q) = 
snog D a (Pg| | Q). Then the domain of / is A„, and since 
/ is the supremum over continuous, convex functions, it is 
lower semi-continuous and convex itself. As convexity on a 
simplex implies upper semi-continuity l37l Theorem 10.2], it 
follows that / is both lower and upper semi-continuous, and 
is therefore continuous. As the domain of / is compact, this 
implies that it also attains its minimum. ■ 

Theorem 32. Suppose X is finite and let a € [0, oo]. If 
there exists a (possibly non-unique) capacity achieving input 
distribution 7r op t, then J D a (Pe\\Q) d7r op t(#) is minimized by 
Q = Qopt and -D Q (-P6>||Qopt) = Ra almost surely under 7r opt . 

If R a is regarded as the radius of {Pg \ 6 <G 9}, then this 
theorem shows how Q opt may be interpreted as its center. 
Proof: Since 7r pt is capacity achieving, 

C a =M J D a (Pg\\Q) d7r opt (#) 

< f D a (Pel | Qopt) d7r opt (0) 

< J R a dTT opt (9) = R a =C a . 

The result follows because both inequalities must be equalities. 

■ 

Three orders a for the channel capacity C a and minimax 
redundancy R a are of particular interest. The classical ones 
are a = 1, because it corresponds to the original definition of 
channel capacity by Shannon, and a = because Co gives an 
upper bound on the zero error capacity, which also dates back 
to Shannon. 

Now let us look at the case a — oo, assuming for simplicity 
that X is finite. We find that 



sup Pqo (Pe» ||Q) = sup max In 



Pe{x) 



max In 



Q(x) 

SUpg Pg (x) 

Q(x) 



is the worst-case regret of Q relative to {Pg \ 9 e 9} O. It 
is well known Q, 1421 that the distribution that minimizes 
the worst-case regret is uniquely given by the normalized 
maximum likelihood or Shtarkov distribution 



S(x) 



SUPg Pg{x) 



sup e P e (a;)' 
which achieves worst-case regret 



Rc 



sup Pg (x) 



Thus in this case Q opt = S is unique. Moreover, by some 
algebra we can quantify the amount by which any other 
distribution Q ^ S exceeds the minimax redundancy, namely 

by Poo(SIIQ): 



Theorem 33. Suppose X is finite. Then the worst-case regret 
of any distribution Q satisfies 

su P P 00 (P e ||Q) = su P P co (P e ||5) + Poo(S||Q). (34) 



Proof: We have 

SUp 9 Pg(x) f SnpgPg(x) S(x) 
max In -— = max In — -r h In 



Q(x) 



S(x) 

In > sup Pg(x) + max In 



sup e P e (a;) 
max In — — — V max In 

x S(x) x 



Q(x) 
S(x) 
Q(x) 

S(x) 



Q{x) 



The previous result generalises to any positive order a as a 
one-sided inequality: 

Theorem 34. Suppose X is finite. Then, for a £ (0,oo], 

Q pt = argminsupP Q (P e ||Q) 
Q 

uniquely exists and for all Q 

sup D a (P g \\Q) > sup P Q (P e | I Qopt) + P«(Q pt||Q). (35) 

6 9 

This result, which is new, is reminiscent of Sibson's identity 
El, l43l . It shows that any distribution Q that is close to 
achieving the minimax redundancy in the sense that 

sup P a (Pg ||Q) < su P P Q (P(,||Qopt) + (5, 

9 9 

must be close to Q opt in the sense that 

Pa(QoptHQ) < 5. 

As shown in Example [3] below, Theorem [34] cannot be ex- 
tended to a = 0. For a > 0, we will prove it by expressing it 
as a minimax identity for the function 

<f>a(R 7 Q) = sup D a (Pg\\Q) — D a (R\\Q), 

where we adopt the convention that (f> a (R,Q) = oo if both 
su Peee D a (Pg ||Q) and D a (R\\Q) are infinite. 

Lemma 8. Suppose X is finite. Then, for a G (0,oo], 

maxmin 6 a (R, Q) = mm max d> a (R,Q). 
R Q Q R 



(36) 



Moreover, Q opt = argming sup flg Q P a (Pg||Q) uniquely ex- 
ists and (P, Q) = (Q op t, Qopt) is a saddle-point. 

Theorem [34] then follows by the following argument. 
Proof of Theorem \34[ The definition of a saddle-point 
implies that <p a (Q opt , Q) > </> Q (Q op t, Qopt) for all Q, from 
which the theorem follows after plugging in the definition of 

4>a- ■ 

To prove Lemma [8] we will use Sion's minimax theorem 
(Theorem [3TT > again. Verifying its conditions requires the 
following lemmas. 

Lemma 9. For any a G [0, oo], <t> a (-, Q) is quasi-concave for 
any Q and <j) a (R, ■) is quasi-convex for any R. 
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Proof: Quasi-concavity in the first argument follows 
because D a (R\\Q) is quasi-convex in R. To show quasi- 
convexity in the second argument, let R, Qq, Qi and A G (0, 1) 
be arbitrary, define Q\ = (1 — A)Qo + XQi, and observe that 

MR, Qx) = supD a {P e \\Q x ) - D a {R\\Q x ) 



< sup (Jl - X)D a (P e \\Q a ) + A£> Q (P e ||Qi)J - D a (R\\Q x ) 

< (1 - A) supD a (P g \\Q ) + XsupD a (P e \\Qi) 

e e 

-D a (R\\Q x ). 

As D a (R\\Q\) is convex in A and the other terms are 
linear, this upper bound is concave in A. It follows that it is 
maximized at one of the endpoints of its domain, A = and 
A = 1, where it equals (f> a (R, Qo) or <j) a (R 7 Qi), respectively. 
We therefore find that 

4> a (R,Q\) < max{(f) a (R,Qo),(t) a (R,Q 1 )}, 

which was to be shown. ■ 

Lemma 10. Let a G (0,oo]. Then Q (-,Q) is upper semi- 
continuous for any Q (in the topology of setwise convergence). 

Proof: By lower semi-continuity of D a (-\\Q). ■ 

Let A„ = {(pi,...,p n ) | Yn=iPi = l >Pi ^ °l denote 
the probability simplex on n outcomes, and let ri(A n ) = 
{(pi,...,Pn) I Yn=iPi = hPi > 0} denote its relative 
interior in M n . 

Lemma 11. Suppose X has n elements. Then, for a G [0, oo] 
and any R, 4> a (R, ■) is continuous on ri(A„) and upper semi- 
continuous on A„. 

Proof: The function sup e D a (Pg\\-) is continuous on A ra 
by Lemma [7] and D a (R\\-) is continuous by Theorem [T71 It 
follows that their difference is continuous as long as at least 
one of the two is finite. 

On ri(A n ) both are finite, and hence <fi a (R, •) is continuous. 
Only for a sequence Qi, Q 2 , ■ ■ . converging to a point Q* 
on the boundary of A„ may continuity break down, if both 
supg D a (Pg\\Q*) and D a (R\\Q*) are infinite. In this case 
4>a(R, Q*) — oo by definition, and hence we still have upper 
semi-continuity. ■ 

Proof of Lemma [3} Let n denote the number of elements 
in X and, for e G (0, 1/n], define A^ = {(pi, ■ . . ,p n ) \ 
Yl7=iPi = IjPi > £ }- Then, on A„ x A^, Lemmas [9] through 
[TTI show that cf) a (R,Q) is upper semi-continuous and quasi- 
concave in R, and continuous and quasi-convex in Q. Thus it 
satisfies the conditions of Theorem [3T| so that 

max min <p a (R,Q) = min max0 a (i?, Q). (37) 

(The suprema over R are attained, because <f> a {-, Q) and hence 
also minQgA E 4>a(',Q) are upper semi-continuous functions 
on a compact domain.) 

Let QI t G argminQ gAE sup e D a (Pg\\Q) be a distribution 
that achieves the minimum on the right-hand side of (TTTb . Then 



for any R ^ Q^ pt we have 
wn<t> a (R,Q)<4> a (R,Q^ 

< sup D a (P e \\Q, 



opt / 



min max<f> a (R,Q), 



)£A' R 



so only R = Q^ pt can achieve the maximum on the left-hand 
side of (O. It follows that (R, Q) = (QL V QL t ) is a saddle- 
point, and that Q € opt is unique. 

Let ex > £2 > •■• > Obea decreasing sequence that 
converges to 0. Then Ql pv Qopt> • ■ • is an infinite sequence 
in a compact domain, and hence (by the Bolzano- Weierstrass 
theorem) there is a subsequence e[ > e' 2 > . . . > such that 

Qopt> Qopt, • ■ • converges to some Q* G A„. 

Now let Q G ri(A„) be arbitrary. Then upper semi- 
continuity of <fi a (-,Q) implies that 

</>„(Q*,Q) > limsup</> a (<3opl,<9) 

> limsup min 0a (Qopt jQ) 
= limsup min max0 Q (i?, Q) 

m^oo QfzA'r R 

> inf max (f> a (R,Q). 

QeA„ R 

Together with upper semi-continuity of 4> a {Q* , •) on A„ (see 
Lemma ITTb this implies that also for any Q on the boundary 
of A„ 

0a(Q*,Q) > limsup0 Q (Q*,(l- X)U + XQ) 

Afl 

> inf max(j) a (R, Q), 

QtE A n R 



where U = (1/n, . . . , 1/n) is the uniform distribution. Thus 

(38) 



max inf d> a (R,Q)> inf (f> a {Q*,Q) 
R QeA„ QeA„ 



> inf max ri„, (R, Q) . 

- Q6A„ R ^ V 

Since the max inf never exceeds the inf max l37l 
Lemma 36.1], these inequalities must in fact hold with 
equality. 

It remains to establish that Q opt uniquely exists and that 
(R, Q) = (Qopt, Qopt) is a saddle-point. By Lemma [7] we 
know that there exists a distribution Q' that minimizes 
sapg D a (P g \\-). Suppose Q' ^ Q*. Then 4> a (Q*,Q') < 
ming sup e D a (Pg\\Q), which would contradict ( |38l . Hence 
Qopt = Q' = Q* is unique. 

Moreover, (l3Ft implies that 

(Qopt, Q) > rninmax(7j Q (i?, Q) = Q (Q opt , Q opt ) 
Q R 

for all Q, and it may be directly verified that <p a (R,Q pt) < 
0a(Qo P t, Qopt) for all R. Thus (Q op t,Qo P t) is a saddle-point, 
which concludes the proof. ■ 
A distribution 7r on the parameter space is called a 
barycentric input distribution if 



;opt 



P S dir(6). 
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Example 3. Take a £ (0, oo] and consider the distributions the infimum in 



Pi 



A In 



Po 



(0, 



I b 

2' 2' 



on a three-element set. Then by symmetry the unique redun- 
dancy achieving distribution has the form 

Qopt = {q,q,l - 2q). 
If a is a simple order, then for 8 £ {1,2} the divergence is 

D a (Po\\Q opt ) = -b-]n (Q V~ Q + (^) Q (1 - 

= ^ + J-h(„'- + (l-2 9 r). 
1 — a a — 1 

To find 5, we therefore we have to extremize 

f(q)=q 1 - a + (l-2q) 1 - a , 



which leads to 



1 



q = 



(39) 



The reader may verify that (139t also holds for a = 1, giving 
Qopt = (j, |, I), and for a = 00, giving Q opt = (|, |, ±). 
Note that only for a = 1 is Q opt a convex combination of Pi 
and P2, with unique barycentric input distribution tt — (i, i). 

Finally, consider a = 0, for which Theorem [34] does not 
apply. In this case ( f39b still holds, giving Q opt = (0, 0, 1). 
Now let Q = (|, |, 0). Then, for 6> G {1, 2}, we see that the 



first two terms in 



are well-behaved: 



limsup£> Q (P e ||Q) = sup ,D (Pe||0) = In 2, 

qJ.0 g 

lim sup D a (Pe || Qopt) = = sup -D (Pe II Oopt)- 

aiO g g 

For the last term, however, lim Q ^o D a (Q opt \\Q) = In 2, 
whereas £>o(Qopt||Q) = 00, and so we obtain a counterex- 
ample to ( 1331 ). 

Example 4. Let 8 £ [0, 1] denote the success probability of 
a binomial distribution Pg = Bin(2, 6) on X = {0,1,2}. 
Then for a = 00 the redundancy achieving distribution is 
S = (|, |, |) and the minimax redundancy is i?oo = In |. 

In this case there are many barycentric input distributions. 
For example, the distribution tt = i Mo + |?7+iA/ 1 is a 
barycentric input distribution, where M$ is a point-mass on 8 
and {/ is the uniform distribution on [0, 1]. Another example 
is the distribution tt = , |, ^) on the maximum likelihood 
parameters ^ = {0, i, 1} for the elements of A". 

If there exists a capacity achieving input distribution 7r opt , 
then by Theorem [32] it must be such that 



Poo(Pe||S) 



(40) 



almost surely under 7r opt . The only 8 that satisfy d40l are the 
maximum likelihood parameters in the set ^> defined above, 
and hence 7r opt must be supported on vp. Using positivity of 
Kullback-Leibler divergence (Theorem[8]l, it can be shown that 



inf]T£oo(Pe||QK pt (#) 



= inf 



{**(0) In oJoT+-o Pt (|) ^ ^T+-o pt (l) In ^} 



0(0) 



0(1) 



0(2). 



(41) 



is uniquely achieved by Q = (7r opt (0), 7r op t(|), 7r opt (l)). If 7r opt 
is to be the capacity achieving input distribution, then this Q 
must equal S by Theorem [32] and hence 7r opt = (|, |, |) on 
Evaluating fiTT i for this choice of 7r opt , we indeed find that 
it equals In | = i?oo = as required, and thus 7r opt is the 
unique capacity achieving input distribution. 

V. Negative Orders 

Until now we have only discussed Renyi divergence of non- 
negative orders. However, using formula (0 for a £ (— 00, 0) 



(reading 



for p a q a ), it may also be defined for these 



negative orders. This definition extends to a = —00 by 



P'-oo(PHQ) 



lim D a (P\\Q). 



According to Renyi [lj, only positive orders can be regarded 
as measures of information, and negative orders indeed seem to 
be hardly used in applications. Nevertheless, for completeness 
we will also study Renyi divergence of negative orders. As 
will be seen below, our results for positive orders carry over 
to the negative orders, but most properties are reversed. People 
may have avoided negative orders because of these reversed 
properties. Avoiding negative orders is always possible, be- 
cause they are related to orders a > 1 by an extension of 
skew symmetry: 

Lemma 12 (Skew Symmetry). For any a £ (—00,00), a g" 
{0,1} 



A*(P||0) 



1 - a 



£>i_ Q (0||P). 



(42) 



Furthermore 



£>-oo(P|IO) 



P'oo(OIIP) 

P{A) 



In inf 



In ess inf - 
Agj-Q(A) \ Q q 



with the conventions that 0/0 = and x/0 — 00 for x > 0. 

Proof: The identity (l42l follows directly from definitions. 
It implies D-oa(P\\Q) = — Poo(OI|P)> because tends to 
— 1 as a —> —00. The remaining identities follow from the 
closed-form expressions for Doo(OI|P) m Theorem [6] ■ 
Skew symmetry gives a kind of symmetry between the 
orders | + a and | — a. In applications in physics this 
symmetry is related to the use of so-called escort probabilities 

m. 

Whereas the nonnegative orders generally satisfy the same 
or similar properties for different values of a, the fact that 
j-^j < for a < 0, implies that properties for negative 
orders are often inverted. For example, Renyi divergence for 
negative orders is nonpositive, concave in its first argument and 
upper semi-continuous in the topology of setwise convergence. 
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In addition, the data processing inequality holds with its 
inequality reversed and for a £ (— oo,0) Theorem [2] applies 
with an infimum instead of a supremum. 

Not all properties are inverted, however. Most notably, it 
does remain true that Renyi divergence is nondecreasing and 
continuous in a: 

Theorem 35. For a £ [—00,00], the Renyi divergence 
D a (P\\Q) is nondecreasing in a. 

Proof: For a < 0, D a (P\\Q) < and for a > 0, 
D a (P\\Q) > 0, so the divergence for negative orders never 
exceeds the divergence for nonnegative orders. The remainder 
of the proof follows from Theorem [3] and skew symmetry. ■ 

Theorem 36. The Renyi divergence D a (P\\Q) is continuous 
in a on 

A = {a £ [-00, 00] I < a < 1 or \D a (P\\Q)\ < 00}. 

Proof: Renyi divergence is nondecreasing in a, nonneg- 
ative for a > and nonpositive for a < 0. Therefore the 
required continuity follows directly from Theorem [7]and skew 
symmetry, except for the case 

limD a {P\\Q) = D (P\\Q), 

which is required to hold if there exists a value /? < 
such that Dp(P\\Q) > -00. In this case D 1 _ /3 (Q\\P) = 
^■Dp{P\\Q) < 00, which implies: (a) that Q < P, so 
D (P\\Q) = 0; and (b) that D(Q\\P) < 00 and by Theorem|5] 



limA,(P||Q) = lim- 

afO ajO 1 



Q 



-D 1 _ a (Q\\P) = 0-D(Q\\P) = 0. 



VI. Summary 

We have reviewed and derived the most important prop- 
erties of Renyi divergence and Kullback-Leibler divergence. 
These include convexity and continuity properties, limits of 
er-algebras, additivity for product distributions on infinite 
sequences, and the relation of the special order to absolute 
continuity and mutual singularity of such distributions. 

We have also derived several key minimax identities. In 
particular, Theorems |27]and|28]illuminate the relation between 
Renyi divergence, Kullback-Leibler divergence and Chernoff 
information in hypothesis testing. And Theorem [30] extends 
the known equivalence of channel capacity and minimax re- 
dundancy (for all orders) to continuous channel inputs. A new 
result relating the worst-case redundancy of a distribution to its 
divergence from the (unique) minimax redundancy achieving 
distribution was given by Theorem l34l 
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Appendix: Negative Results 

Some useful properties that are satisfied by other diver- 
gences, are not satisfied by Renyi divergence. Here we give 
counterexamples for a few important ones. 



A. No Pythagorean Inequality 

An important result in statistical applications of information 
theory is the Pythagorean inequality for Kullback-Leibler 
divergence ll27l . l45l . ll46l . It states that, if V is a convex 
set of distributions, Q is any distribution not in V, and 
Anin = inf pg-p D(P\\Q), then there exists a distribution P* 
such that 



D(P\\Q) >D{P\\P*)+D a 



for all PeP. 



The main use of the Pythagorean inequality lies in its impli- 
cation that if Px,P2, ... is a sequence of distributions in V 
such that D(P n \\Q) —> D m \ n , then P n converges to P* in the 
strong sense that D(P„\\P*) -> 0. 

Unfortunately, for q^I Renyi divergence does not satisfy 
the Pythagorean inequality, as demonstrated by the coun- 
terexamples below. We should point to results by Sundaresan 
P7l . however, who argues that, under regularity conditions, 
for finite sample spaces a generalisation of Renyi divergence 
(see [48]) does satisfy a modified Pythagorean inequality, in 
which every distribution R £ {P, Q} is replaced by its tilted 
counterpart 

R(x) a 



R'(x) 



E v R(y) a ' 



To construct the counterexamples for the ordinary 
Pythagorean inequality, first consider a £ [0,1). Let 
Q = (|, |, i) be uniform on three points and let V = 
{(j>i,P2,P3) I Pi = 7} be the convex set of distributions 
with first component fixed at |. Then infp e -p D a (P\\Q) is 
achieved by P* = (i, |, |) and the Pythagorean inequality 



D a {P\\Q) > D a (P\\P*) + D a (P*\\Q) 



(43) 



is violated for P = (|, 0, |): if a > 0, then d43l is equivalent 
to 



1 + 3" < 
(1 - 2 1 ~ Q )(23 Q - 32 Q ) < 0, 



1 



-2 Q 



1 



ln|, 



which is false. If a = 0, then D a (P\\Q) 
D a (P\\P*) =-ln| and D a (P*\\Q) = 0, and the inequality 
does not hold either. 

Secondly, for a £ (l,oo] take Q = 



'i 1 r 

v3' 3' 3. 



and V = 



{{pi,P2,Pz) I Pi = §}. Then mf PeV D a (P\\Q) is achieved 
by P* = and the inequality is violated for P 

(|,0, ^): if a < 00, then (|43l is equivalent to 

6(1 + 2 a ) > (4 + 2 Q )(2 1 " Q + 2") 
(2 Q -2)(4 Q -4)<0, 

which is false. If a = 00, then D a (P\\Q) = D a (P\\P*) 
D a (P*\\Q) — In 2 and the inequality does not hold either. 
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B. Convexity in P does not hold for a > 1 

Renyi divergence for a S (1, oo) is not convex in its 
first argument. Consider the following counterexample: let 
< po < pi < 1 be any two numbers, and let p\i 2 = p "^ Pl . 
Let £ > be arbitrary, and let < q < 1 be small enough 
that 



max 

ie{o,i} 



pc* q i- a 



< e. 



Then convexity of D a in its first argument would imply that 

^ln(pfa 1 - a + (l-p ) a (l-«) 1 - a ) 

+ 1 -\n{p a 1 q 1 - a + {l-piY{l-q) 1 ~ a ) 

> In (p« /2 q l - a + (1 - Pl/2 ) a (l - q) 1 -") , 

which implies 

l - In {p^- a {l +e))+\ In (p^ 1 "" (1 + s)) > In (p?/^ 1 "" 

i In (p a (l + £)) + i In (p?(l + e)) > In (p? /2 ) . 
As this expression holds for all e > 0, we get 

ilnpff + ilnpf >lnp? /2 

2 m Po + 2 ln Pi - m n ' 

which is a contradiction, because the natural logarithm is 
strictly concave. 

C. Renyi divergence is not continuous 

In general the Renyi divergence of order a € (0,1) is 
not continuous in the topology of setwise convergence. To 
construct a counterexample, recall that r = 2ir, let P n denote 
the probability distribution on [0, r] with density 1 + sn ^ nx '> an( j 
let Q n denote the probability distribution on [0, r] with density 
1 ~ si " ( " a) for n = 1,2,... Then D a (P n \\Q n ) > does not 
depend on n, and both P n and Q„ converge to the uniform 
distribution U on [0, r] in the topology of setwise convergence. 
Consequently, lim„_ i . 00 D a (P n \\Q n ) ^ = D a (U\\U), so 
in general D a is not continuous in the topology of setwise 
convergence. 

D. Not a metric 

Except for the order a = \, Renyi diveraence is not sym- 
metric and cannot be a metric. For a = ^ Renyi divergence 
is symmetric and by (O it locally behaves like the square 
of a metric. Therefore one may wonder whether it actually 
is the square of a metric itself. Consider the following three 
distributions on two points: 



P = (0,1) ; 



Q 



'I I] 

.2' 2) 



R=(1,0). 



Then 



Di {P\\Q) =ln2, Di(Q\\R) = ]n2, Di(P\\R) = oo. 

As the square roots of these divergences violate the triangle 
inequality, D 1 / 2 cannot be the square of a metric. 
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